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[57] ABSTRACT 

A document storage and retrieval system is provided with 
means for storing a document body in the form of image, 
means for storing text information in the form of a character 
code string for retrieval, means for executing a retrieval with 
reference to the text information, and means for displaying 
a document image relating thereto on a retrieval terminal 
according to the retrieval result Such a form of the system 
is available for retrieving the full contents of a document and 
also for displaying the document body printed in a format 
easy to read straight in the form of image. Accordingly, users 
are capable of retrieving documents with arbitrary wards 
and also capable of reading even such a document as is 
complicated to include mathematical expressions and charts 
through a terminal in the form of image, the same as on 
paper. Further, the invention provides a system wherein the 
text information for retrieval is extracted automatically from 
the document image through character recognition. Since a 
precision of the character recognition has not been satisfac- 
tory hitherto, a visual retrieval and correction have been 
carried out without fail by operators. However, there is no 
necessity for the operators to attend therefor according to the 
invention. Thus, the text information for retrieval can be 
generated at the cost of practical time and money even in 
case of volumes of documents. 

11 Claims, 16 Drawing Sheets 
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DOCUMENT STORAGE AND RETRIEVAL 
SYSTEM FOR STORING AND RETRIEVING 
DOCUMENT IMAGE AND FULL TEXT DATA 

This is a divisional of application Sex. No. 07/139,781, 5 
filed Dec. 30, 1987, now U.S. Pat No. 5,265.242 issued on 
Nov. 23, 1993 which is a divisional of parent application Sen 
No. 06/894,855. filed Aug. 8. 1986 now abandoned, which 
was continued as Continuation application Set. No. 07/559, 
994 filed JuL 30, 1990 which issued as U.S. Pat No. 10 
4.985.863. 

BACKGROUND OF THE INVENTION 

The present invention relates to a document storage and 
retrieval system for filing documents as an image, and is 15 
particularly concerned with a document storage and retrieval 
system capable of full, text searching. 

The typical information retrieval system has hitherto 
provided a retrieval of data chiefly according to a keyword 
and a classification code. Bibliographic information and 20 
patent information have been processed to form a data base 
by means of the system mentioned above. Mainly biblio- 
graphic information including abstracts in it coverage is 
processed for a data base here, but the situation is such mat 
only a part of its function is realized to cope with the true 25 
need of information retrievaL That is, even if a document or 
patent conceivably relevant is found, there is the need to 
search among a lot of bookshelves to obtain the text. 

Meanwhile, an optical disk capable of storing a mass data 3Q 
has now been available for loading the text in the data base 
to provide a so-called original document information 
service, thus coping with a social need. A paperless docu- 
mentation at the Patent Office is so planned accordingly. In 
these systems, volumes of documents are stored in optical 35 
disks in the form of image data, and a conventional infor- 
mation retrieval technique based mainly on a keyword 
search is applied. 

However, the conventional information retrieval tech- 
nique is only effective to orders of tens to hundreds, and ^ 
hence a further technique for squeezing relevant documents 
to 1/10 in number or so is desired. One method is that in 
which an original document (text) stored as image data i 
called onto a terminal and read visually by a retriever. The 
method is secure in principle, however, documents amount- 45 
ing to hundreds maxtmumry are too many to read out in the 
form of image data, and reading one by one visually is not 
efficient practically as a matter of course. 

On the other hand, the conventional method based on the 
keyword and classification code must be updated all the time 50 
for the classification system itself changes as time passes, 
thus leaving an intrinsic problem. For example, volumes of 
documents classified already cannot be modified practically 
as the classification system is subjected to modification later. 
Documents and patents recording a progress of science and 55 
technology are in content and hence of value because they 
provide a new data conception which often is not included 
in the conventional classification system. For this purpose, it 
is impossible to define beforehand the keyword and the 
classification system representing a conception originally, &> 
which is a problem essentially for the iriformation retrieval 
system. 

For the reason as mentioned above, it is desirable to 
provide a method which will retrieve contents with reference 
directly to the text of a document According to the method 65 
for referring to the text a retrieval can be practiced by means 
of a vocabulary recognized as a conception which was not 



2 

deemed to be important when the document was registered 
in a data base but is taken new at the point of time of 
retrievaL Or otherwise, an important document can be 
searched out directly without a "filter" or an indexer 
(specialized for giving index) at the time of registration. 

To satisfy such a requirement, it is necessary that a 
character pattern is extracted from the document as image 
data and the text is replaced by a character code, and a 
character recognition technique may be applied therefor. 
However, a document or a printed document, for example, 
which is an object for filing is not perfect character recog- 
nition from the point of view diversification of the kinds of 
print quality and font In a conventional optical character 
reader, imperfect recognitions such as error, rejection and 
the like are subjected to checks and corrections by operators. 
(For example, "Introduction to Character Recognition" by 

Hashimoto, Ohm-Sba, . 1982, pp 153-154) Accordingly, 

even if the recognition precision is extremely high, a method 
for checking visually a result obtained through recognizing 
the text is not realistic where the amount of documents is 
very large, and hence a document filing system with images 
as the main constituents which is available for text retrieval 
has not been realized until now. 

SUMMARY OF THE INVENTION 

An object of the invention is to provide a document 
storage and retrieval system having a full text retrieval 
function with reference directly to the text of a document by 
solving the problems referred to above. 

In order to attain the above-mentioned object, the inven- 
tion stores and retrieves both the document image data and 
full-text data, where full-text data is used to support the 
full-text search capability, and the image data is used to 
present or display the contents of the retrieved documents to 
the retriever. This system inputs character strings at the 
retriever's request, and searches for these strings in character 
strings in the full-text data. By searching for the correspond- 
ing image file identifiers in the image file directory, the 
locations of the corresponding document image data are 
identified and the retrieved document images are displayed 
onto the document retrieval terminal. 

This system further recognizes the contents of the docu- 
ments from the image data and stores the resulting text data 
to support the full-text retrieval capability. To overcome the 
problem of insufficient character recognition accuracy, the 
character recognition module of this system outputs multiple 
candidates of character codes when more than one charac- 
ters have very high similarity values, thereby avoiding 
misrecognition. The full-text data so created therefor 
includes some ambiguity. For example, an ambiguous text is 
represented as". . . S[mw] [il] m . . . '% where [mw] 
represents two characters "m" and "w" which are the can- 
didates for the recognized character, and [il] represents "i" 
and "1" which are the candidates having the most similarity. 
The full-text search mechanism of this system can identify 
that a substring "Smith" is included in the string documents 
with high accuracy even from the full-text of document 
recognition results. 

As shown in FIG. 1, a document 10 is transformed into a 
predetermined special character notational expression as 
indicated by 20 in the system according to the present 
invention. The symbol string used is that provided in lan- 
guages such as LISP. It follows a notation called 
S-expression. A process in which the document (image) is 
transformed into a notational expression 20 is called docu- 
ment understanding or document recognition. The notational 
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expression signifies roughly the following. That is, the FIG. 17 is a block diagram of a flexible string matching 

document is numbered 99. the class is 'Technical Paper", circuit; 

VOL?=5. NO=7, the author is named "Peter S[mw] [il]th", FIG. 18 is an extended finite state automaton permitting 

the title is "Fu[ll][ll]ATextA[RB]etr[il]e . . . the text is an ambiguous character string; 

. . Fu[ll][ll]ATextAse[ao]rch ... w and so forth. Here, A 5 n Q 19 is a transition table of the extended finite 

indicates a blank (space) and so forth. state automaton; 

In the character recognition, that of ambiguity includes, in jjjq 2 0 is a drawing illustrating a program of FSM 

most cases, a character pattern which can hardly be coped circuit; 

with normally. mQ 21 is a configuration drawing of a flexible string 

For retrieval, meanwhile, a user inputs "FULLA 10 etching circuit in a second embodiment 
TEXTARETRXEVAL" from a keyboard Generally, there are 

such languages as will express the same meaning in different DESCRIPTION OF THE PREFERRED 

words, and in this case 'TULLATEXTA SEARCH" has also EMBODIMENTS 

the same meaning. While handling such ambiguity jhe invention will now be described with reference to 

automatically, the system is capable of searching documents 15 illustrative examples. FIG. 4 is a configuration drawing of a 

having the same character string. document storage and retrieval system forming one embodi- 

A plurality of partial character strings to be found out of ment of the invention. The system comprises a control 

the sentence to be retrieved are expressed by a finite state subsystem 100 providing a general control and a data base 

automaton as shown in FIG. 2. The title character string function, an input subsystem 200 for inputting a document 

which is one of the sentences to be retrieved as exemplified and others and registering in a file, a document recognizer 

in FIG. 1 can be expressed similarly by the automaton of 300 for recognizing documents, a text search subsystem 400 

FIG. 3. In this case, however, there is no distinction between for carrying out a higjh-speed text search, and a terminal 

a capital letter and a small letter. The invention provides a subsystem 800 for carrying out a retrieval, 

text search (character string retrieval) function in case there Aconfiguration and a flow of operation of each subsystem 

is present an ambiguity (a plurality of possibilities, or the ^ be described in detail below. 

state wherein elements which cannot be decided identically The input subsystem 200 has a central processing unit 

are present) on both searching key (partial character string) (CPU) 201 for controlling the subsystem, a main memory 

and sentence to be retrieve, which is a third principle. 2 02, a system file 251 and a terminal 203 as a basic division. 

A method given in a report [by A. V. Aho, et al. •'Efficient 3Q The subsystem is controlled by operation from the terminal 

String Matching: An Aid to Bibhographic Search, "Com- 203. an image on each page of a document 220 is read 

munications of the ACM, Vol. 18, No. 6, 1975] is well optically by a scanner 221, and digitized image data is stored 

known for searching a plurality of partial character strings first in a video memory 224 by way of a bus 210. The image 

out of an unambiguous text by the infinite state automaton. data i s then subjected to a redundant compression on an 

BRIEF DESCRIPTION OF THE DRAWINGS * TL^T^ ^\^V^^«£ ^Zftt 

Hufimann) code or MR (Modified Read) code and then 

FIG. 1 is a drawing showing a document image and a returned to another area of the video memory 224. 

result of document understanding; The inputted document image is displayed on the terminal 

FIG. 2 is a state transition diagram of a synonymic 203 for confirmation, and the operator is capable of inputting 

character string generated from a partial character string; ^ bibliographical items such as the tide, author's name, cre- 

FIG. 3 is a state transition diagram of a character string as ation data and others while observing the image displayed 

a result of character recognition which includes ambiguity; thereon. As will be described hercinlatcr, bibliographical 

FIG. 4 is a system configuration drawing of a first items of a formatted document can be read automatically 

embodiment; through document understanding, however, bibliographical 

FIG. 5 is a table of the main directory keeping the 45 items of a not-formatted document and items of information 

bibliographic data; which are not entered in paper must be inputted manually. 

FIG. 6, shows tables for storing location information of F<* example, it is natural that a classification code of 

text data and image data; document contents defined by users and a keyword which is 

FIG. 7 is a table storing publication information; P««f on Paper fhould be inputted by the operator. 

W1 l?o^ y : a- ^ „ , „ w can be inputted from the terrninal 203. A data of such 

fT* a ^ relatlonshl P w,th bibUographical items and others inputted as above is corre- 

the body file. i a ted with an image data (compressed data) in the video 

FIG. 10 is a block diagram of a document recognizer; 55 memory 224 and is then loaded in the main memory 202. 

FIG. U is an explanatory drawing of a rectangular area Here cach docunient is gi vcn a proper number (document 

surrounding a character pattern; n)) and stored in the memory so as to draw image data and 

FIG. 12 is a drawing illustrating a contour expression bibhographical items using the proper number of the docu- 

method for describing a pattern; ment as a key. The document proper number can be 

nG.13isaarawkgiltostratinga &) expressed, for example, by coupling an identifier number 

components and character pattern; ('INSYS 01' and the like) of the subsystem to the character 

FIG. 14 and FIG. 15 are drawings showing a result of string indicating date and time. For example, INSYS01. 

segmenting rows and columns respectively by means of a 850501.132437 indicates a document inputted from an input 

bottom-up segmenter; subsystem INSYS01 at 13 h: 24 m; 37 s on May 1, 1985. 

FIG. 16 is an explanatory drawing of an algorithm for 65 There may be a case where the input time is important 

obtaining a state transition list from a character string according to application of the system, and hence it func- 
tions as a time stamp otherwise. 
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Now. whenever a predetermined quantity of the document at every operations for demounting and mounting the vol- 

is accumulated in the subsystem 200 or a predetermined ume by operators. 

command arrives from the terminal 203, an interrupt signal Then, FIG. 8 shows a directory provided at each volume 

is sent to a bus adapter 171. of the image file 152, and the following columns are 

A control subsystem 100, sensing the interrupt signal 5 provided therein, 

reads a predetermined address in the memory 202 of the IMGID: An image proper number, 

input subsystem 200. The contents of a request of the input pn : A serial page number (1 to n) in a document, 

subsystem can thus be decided. PHYSA: A physical address in a volume. 

An operation follows as described below upon request of SLENG: A record length (sector number, for example), 

a region of the inputted document in a data base. An ^ code name . 

The centralprocessing unit (CPU) 101 is acquainted with s 

the proper number of documents stored temporarily in the a ZlT 

input subsystem 200 according to a predetermined program DOC#: A document serial number, 

in a main memory 102 and further with a memory address „ T^en in the drawhig, data in the column PHYSA of a record 

of bibUographical data (bibtiographical items) relating 15 * 57 mdicates a leading address of image data 158 in an 

thereto and image data. ^ * image data area 156 in the image file. 

. - . ■ 1AA . « . . i-i r Now, whenever the above operations come to .end, the 

The control subsystem 100 has a da* base file 151 for . g rcad for rctricvin & WbliograpllicaI itcms ^ 

storing and inanagmg symbolic data such a^ bibhographical £ fa d ^ ^ ^ 

data and the like, and an image file 152 for stonng and M A re^evalcon<htionmputtedLmmeretr^ 

managing the image data. fa fransmittcd to me cpij 101 of me control subsystem 100 

The bibliographical data read out of the input subsystem way 0 f a gateway 175. A retrieval of a table MAIN-DIR 

200 is written as a new record in a data base (loaded in the (FIG. 5) in the data base file 151 is carried out according to 

file 151) which is given in the form of the table in FIG. 5. a predetermined retrieving program of the memory 102. It 

The table of FIG. 5 is named MAIN-DIR (main directory) 25 goes without saying mat indexing (for high-speed retrieval 

and has the following data columns. such as hashing, inverted file and the like) is applied to main 

DOC#: A serial number of document registered in the columns of the table MAIN-DIR. 

system As a result of retrieving, a list of DOC# from the table 

ID: A document proper number given by the input sub- MAIN-DIR (FIG. 5) and a list of image proper number 

system. 30 IMGID are made out and stored in a predetermined area of 

NP: A page number constituting the document. the memory 102. Uponrequest for display from the retriev- 

__ „ A M „ . v ing terminal, a position in the image file is identified by 

TITLE: A title (character string) means of a tablVlMG-LOC 154^G. 6(*)) and a table 

AUTHOR: An author's name (permitting iteration of TMG-DIR 155 (FIG. 8), and the image data is read succes- 

pluraldata). ^ ... 35 sively onto the memory 102. The image data thus read out 

CLASS: A symbol indicating classification, kind and the | s transmitted to the retrieving terminal in turn and then 

like of documents. displayed on a screen according to an indication on the 

PUBL#: A number of publication registered in the system terminal, 

(detail being managed on the table shown in FIG. 7.) A managing method for the text used for full text retrieval 

VOL, NO. PP: Volume, number, page. 40 will be described, next 

KWD- A plurality of keywords As described in the main directory MAIN-DIR (FIG. 5), 

ABS: A text proper number of abstract expressed as a ^J?** ^managed not only for image 

character code^ing (text data). data but also for text expressed in a character code string. In 

__ ' . the example, the abstract and the text are stored and man- 

TXT: A text proper number as a character code stung. 45 aged in text files 451, 452, 453 as a text Each text (character 

IMG A proper number of image data. Since the image data string) is given a proper text number and recorded in 

is managed at every page, a plurality of image proper columns ABS and TXT of the table MAIN-DIR (FIG. 5), a 

numbers are recorded. column TXIID of the table TXT-LOC shown in FIG. 6(a), 

In registration of the bibliographical data, only such data and a column TX1TD of the table TEXT-D1R shown in FIG. 

of the above columns as will relate partly to the biblio- 50 o. 

graphical data is written newly. yIG* 9 indicates a method for storing and managing texts 

Next, the image on a page constituting each document is i n me text 45^ 452, 453. me drawing, a text body is 

read to the control subsystem 100 from a predetermined stored one-dimensionaUy ma file storage area 466. Each text 

storage area of the input subsystem and is then stored (one character string) is given a proper number TX1TD and 

sequentially in an empty area of the image file 152. Each 55 managed in a directory table TEXT-DIR 465. 

image (page unit) is concurrently given an image proper tx^. a text number . 

number (IMGID). Then, a volume number (VOLSER) of the ____ . « r~ £ . ^ ^ ^ 

file having loaded the image data therein, a file unit number NCH: total number of characters cons ™8 * e text 

(UNIT), a loading physical address (PHYSA) in the file, a PHYSA: Aphysical address in which the text is recorded, 

record length (SLENG) in the file and others are written in 60 SLENG: A record length on a storage medium of the text 

tables shown in FIG. 6(a) and FIG. 8. The image proper CCLASS: A class of characters expressing the text 

number INGID given newly is also recorded in IMG column (Chinese character-mixed Japanese statement, English 

of the table MAIN-DIR (FIG. 5). statement, Roman character, kana character and 

Here, a table 1MG-LOC shown in FIG. 6(fc) is particularly others), 

effective when the image file 152 is constituted of a plurality 65 A record 467 of me tabic 465 indicates that the text 

of driving devices or a plurality of volumes, managing the expressed by the record is a portion 468 in the storage area 

location of each image. As a matter of course, it is updated in the file. 
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On the other hand, as shown in FIG. 4. the text can be 
recorded in a plurality of volumes, and the text directory is 
that of managing the text in each volume. When the plural 
volumes are mounted, it is necessary that a presence of a text 
in any of the volumes be known* and the table TXT-LOC 
shown in FIG. 6(a) manages the location of each text. A 
volume serial number VOLSER in which the text having the 
text proper number TXTID is recorded, and a file unit 
number UNIT in which the volume is mounted is managed 
TXT-LOC will be updated automatically as a matter of 
course when a physical volume is demounted or newly 
mounted by operators. 

Then, when input of document images, input of biblio- 
graphic items and registration of documents are constituting 
a large operation is completed, a text recognition (document 
understanding) of the registered document is carried out by 
the document recognition apparatus 300. An input of the 
recognition apparatus is the document image 10 shown in. 
FIG. 1 in an image file 152, and a recognition result output 
is a notational expression 20 shown likewise in the drawing. 
A text portion of the abstract and the text in the notational 
expression 20 is stored newly and so managed by he text 
files 451 to 453 as described hereinabove. 

The document recognition will be described with refer- 
ence to a detailed block diagram of the document recogni- 
tion apparatus shown in FIG. 10. 

The recognition apparatus 300 is connected to a bus 110 
of the control subsystem 100 through a bus adapter 371 and 
controlled by CPU 301. A memory 302 stores data of a 
program and a parameter for controlling operation of the 
apparatus. 

An image data to be recognized is transmitted from the 
image file 152 to a memory 321. The image data is coded 
through compression, decoded to a bit expression image by 
an image processing circuit IP 322 and is again stored in the 
memory 321. Then consecutively, a contour extraction of the 
pattern is carried out by the IP 322 from the image decoded 
to a bit expression, and a result of extraction is again loaded 
in the memory 321. 

The extracted contour data is expressed as follows: 

(i Ci W^^^Mi) ?** ■ • • ( e «*Ui)) 0) 
where i represents a contour proper number (1, 2, 3, ... ), 
and Ci represents a class of the contour. Then, Ci=0 repre- 
sents an outer contour (a mil line 1001 in FIG. 11), and Ci=l 
represents an inner contour (a broken line 1002 in FIG. 11). 
Those x^ x,^, y^ represent a coordinate of the 
vertex of an outer quadrangle of the contour, each, as shown 
in FIG. 11. Further, (x„ y,) is a coordinate of one point Ps 
of the contour length (or, for example, the point found first 
by contour retrieval). With the point Ps as an origin, as 
shown in FIG. 12, the contour data itself is expressed by 
rows of sets of a quantized direction code 9 and a pixel 
number L with the same direction continuing therefor. 

Next, an inclination correction circuit 323 detects a tilt 
angle arising at the time of document input from the contour 
data given by the expression (1), corrects the contour data 
accordingly and then rewrite it to the memory 321. For 
example, a system disclosed by the inventor in Japanese 
Patent Application No. 152210/1985 may be employed far 
the inclination correction algorithm. 

From a portion of the contour data corrected for inclina- 
tion OW *n»n 7 y,»a* y,^, * ™" segmentation and a 
column segmentation are carried out on a bottom-up seg- 
menter (BSG) 324. 

The bottom-up segmenter BSG inputs the data expressed 
in the form ef expression (1), generates a pattern list given 
by the expression (2) and loads it in the memory 321. 
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(j W^W^j) ( 2 ) 

Here, j represents a pattern proper number, the pattern is 
defined as a rectangular area not overlapping mutually, and 

5 the expression (2) further defines vertex coordinates of the 
rectangular area. For example, rectangular areas 1008. 1009 
indicated by broken lines in FIG. 13 are inputs of the BSG, 
however, a rectangle 1010 is obtainable through the BSG. 
The rectangles 1008, 1009 are made of one contour each to 

10 be an element, and the rectangle 1010 is a pattern forming 
one character. An element constituting the pattern j is 
obtainable through searching the rectangle included in a 
rectangular area defined by the expression (2) from the 
contour data of the expression (1). It can be obtained 

15 separately and loaded as data. A result of row segmentation 
and another result of column segmentation are shown dia- 
grammatically in FIG. 14 and FIG. 15 respectively. 

A character segmentation division (CSG) 325 extracts the 

pattern constituting a character from the above pattern list 

20 with reference to a document knowledge arranging regula- 
tions such as document form and the like. As shown in FIG. 
10, the document knowledge is loaded in a document 
knowledge file (DKF) 327. 

Structural regulations of the layout of such as a title. 

^ author's name, author's belonging, abstract, text and the like 
are stored according to each kind of documents in the 
document knowledge file together with a parametric knowl- 
edge such as the size of font The knowledge is described in 
a format description language. The language disclosed in 

^ Japanese Patent Application No. 122424/1985 may be used 
as a format description language. 

The character segmentation division CSG operates for 
integration of a pattern constituting one character which has 
been divided into two patterns or more or, to the contrary, for 

35 compulsory separation of two or more characters which has 
been fused through contact into one pattern. 

The character segmentation division CSG outputs the 
number of the patterns constituting each character in a list 
for each item such as the title, abstract or text as the result 

^ of processing. For example: 

(ABSTRACT "j . . . [jJ^J^] . . . j«T) (3) 

represents that the abstract is constituted of a string of 
characters expressed by a pattern number j*. Here, DJ^. J„+ 

45 2] represents that the character in a combination of three 
patterns j„, j^, j^. 

A character recognition division (CRG) 331 extracts the 
contour data constituting each character pattern, as 
described hereinabove, from the above-mentioned pattern 

50 list (expression (3), for example) and the contour data (given 
by expression (1)) on the memory 321, and transforms it into 
a data structure ready for feature extraction. 

Since a known art may be employed as the character 
recognition technique, a detailed description will be omitted 

55 here, however, after a feature is extracted from the contour 
data, each character can be recognized through a pattern 
matching with the standard pattern in a standard pattern file 
333. In FIG. 10, a memory STPM 334 is one for staring a 
standard pattern with high reference frequency, aiming at a 

60 high-speed processing. 

The result of the character recognition is output, as 
described hereinabove, by the notational expression 20 
shown in FIG. 1. In the process of final decision in the 
character recognition, when a similarity obtained as a result 

65 of pattern matching satisfies an expression (4), a character 
category (character code) co* for giving the similarity is 
output 
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p*£pl min(p 4 -pi)^€irf for 1=1,2, . . . , K (4) variation in farm or spelling* such as "center** and "centre** 

or "DATA" and "data". A hetero-notation generation con- 
where p* is a similarity to the character category K, K is a vcntion ^ a thesaurus are stored in a file 403. 
total category number, and e is a relative threshold. 'TEXT SEARCH** will further be obtainable through 

If the expression (4) is not satisfied, then an aggregation 5 referring to the thesaurus. Further, with reference to the 
of the character category {©^k^Jq, k 2 , . . . } satisfying an hetero-notation generation, a method disclosed in Japanese 
expression (5) is output within two special character codes. Patent Application No. 150176/1985 may also be employed. 
For example, a character (code) string co, CO^ . . . w e is As the result of the above-mentioned processing, an 
output. Here. G), represents "[", and (& e represents *T- aggregation of character strings ( 'TEXT ARETRJEVAL " 

"TEXTASEARCH") is obtainable, after all, to 
p^pi for t=i, 2, . . . , K 'TEXTARETRtEVAL**. This is indicated by an expression 



Px-Pk&i (5) 
Ml»2,3 K} 



(8). 

(Ai...A/...A,) = ("an on ... W (8) 



20 



In case a similar character is present and the expression 15 
(4) is not satisfied by the above processing, a recognition 

result •TUILlltLllATEXTASEA ^CH*' is obtainable. %i «b ... <W 

for example, in response to the input pattern 
'TULLATEXTASEARCIF. The recognition result is buff- 
ered on the memory 321 and then transmitted to the memory 
102 (FIG. 4) collectively. ^ ... <W) 

In the control subsystem 100, a maximum text proper wnere n ^ a num ber of character strings, nif is the length of 
number is detected with reference to the table TXT-LOC an i-th character string, is a character code j-m from the 
(FIG. 6), and a character code string (text) of the recognition lead of an i-th character string A, 
result is registered with a value added by 1 as a new text The subsystem 400 further transforms the expression (8) 
proper number. The registration is carried out with respect to representing the character string aggregation into a state 
the main directory 153, the table TXT-LOC and the table transition list (9) representing the finite automaton illustrated 
465 (FIG. 9). and the text data itself is loaded in any of the { n fig. 2 according to a predetermined program, 
text files 451 to 453. 

Now. the document to which a text data is given as above 3 a list = {(Sji C*i Sn) (9) 

is ready for retrieving using the text search subsystem 400. 

Next, the text search subsystem 400 for retrieving text 
contents and its operation will be described in detail. 

A request for text content retrieval or ABS= (s^ c* S*0 

'TEXTARETREIVAL", for example, which is so made 35 
from the terminal 800 is transmitted first to the control 
subsystem 100. In the subsystem 100, where the document 

to be retrieved has already been narrowed down through (Sjm Ci» CM) 

keyword retrieval or other means, a proper number of the . , , _ . ,^ . «_ 

text incidental to the document is selected from the main 40 w^em wch deri^nt of&e list a list (9) mu^s tot when the 
directory MAIN-DIR 153. and an expression (6) for the list faster C„ is inputted (or . ccmades therewith) in the state 
of proper numbers of the texts to be retrieved is made out ^ * c ^ansmitted to the state S fi . Then in the 

according to each text file with further reference to the table fission, those which are equal to each other are included 

TXT-LOC. m * ' ' * • ' * ' ' 

45 Further, an output list (10) expression is generated. 

(u^t tt t a ...tj) (6 OKst = Aa) (10) 
t*l,2,...,M 

50 

where u, is an i-th file unit number, v, is a volume serial (5 ^ 
number, t^ is a k-th text proper number to be retrieved on the 
volume. Then, M is a maximum number of the text file unit 
On the other hand, when the document to be retrieved has 

not been narrowed, a special symbol (expression (7), for 55 ^ ^ 
example) is sent to the whole text file. 

where (S^, A^) implies that the character string A^, is found 

% . . . , M C7) at the point of time when reaching the state S^. 

^ . ,^ ^ , . , , . FIG. 16 shows a PAD (program analysis diagram) of the 

The expression (6) or (7) and the partial character string for deriving the state transition list (9) and the 

( 4 TEXTARETR]DBVAL M , for example) are transmitted to a 60 output Ust (10) ^ me character string aggregation (8) 

memory 402 of the text search subsystem 400 by way of a expression. *" ^ 

bus adapter 172. a failuie transition list (11) expression is obtained 

In the subsystem 400 (FIG. 4), a hetero-notation genera- from ^ state transition list (9). 
tion processing and a synonym processing of the transmitted 

partial character string are carried out according to a pre- 65 f • • • (S^s^)) (11) 

aeterroined program in the memory 402 the term "hetero- The element (S OT S^) of f list specifies transition of the 

notation" refers to words having the same meaning but a character C k inputted in the state S m to the state S /m with 
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reference to f list when the state to be transmitted is not the algorithm for generating the state transition table will be 

specified in a list (9) expression. It may be called generally analogized easily from illustration according to FIG. 18, a 

a failure function. further description is omitted. 

The f list is provided so as to cope with the case where a TOe microprocessor 511 further transforms the output list, 

reinitialization of the state to S c is generally not correct 5 ° list into the form of an output table shown in FIG. 19(b) 

when a matching is successful halfway of a character string ™* r u ccords * ™ a predetermined area of the memory 513 

but the next character does not coincide in the partial t0 Sfher with the state transition table, 

character string matching, Le., a destination of theprede- . A string search algorithm using the finite state automaton 

termined state transition is not found. For example, a 1S ^ ven as below * 

retrieval of two partial character strings 10 < ^ 

"CHARACTERARECOGNITION" and [Stnn* Search Al^tbm] 



"OPTICALACHAI^ACrERAREADER** is assumed. Sup- T = 'false'; 

posing a sentence reading ". . s=s- 

OPTICALACHARACTERARECOGNTTION ..." is ** 

inputted, a portion up to "OraCALACHARACTERARE" 15 ^ WiHc not^/do 

coincides with the second partial character string but the read (e); 

next character "C is not for matching. Here, if the state is ... . 5,.= next (e, S); 

returned to S 0 to resetting, the automaton processes the ifout(S)<>nil 

ensuing sentence "COGNITION ..." as input characters, then 7: = 'true'; 

therefore the partial character string 20 

M CHARACTERARECOGNITION" will be overlooked ; 

after all. Accordingly, the state to be transmitted in the case Here, the function next (c, S) is one for obtaining the next 

of failure matching is not S 0 , but the state must stand as state from the state transition table shown in FIG. 19(a) on 

matching a transition pass "CHARACFERARE" of the first the character c and the current state S. Further, the function 

character string XHARACTERARECOGNTTION". 25 out (S) is one for deciding whether or not an output is present 

Then next the subsystem 400 transmits the state transi- on the state S with reference to the output table shown in 

tion list a list, the output list o list and the failure transition FIG. 19(6). 

list, f list made out as described above to lower flexible Then, the state is assigned to a unit of one character code 

string matching circuits FSMs 501 to 503. in the above description, however, in case the one character 

A further detailed block diagram of the flexible string 30 code is 2 bytes like Japanese, it is divided into 1 byte each 

matching circuit 501 is shown in FIG. 17. (The block and men the above-described method can be applied thereto, 

diagram applies likewise to FMSs 502, 503.) Next, the text search subsystem 400 accepts the lists (6) 

The above-described three lists, a list o list and f list are expression and (7) expression of the proper numbers of texts 

loaded in predetermined areas of a memory 513 by way of to be retrieved, and transmits them to the corresponding 

an adapter 571. A microprocessor 511 generates an extended 35 FSM as text proper number lists to be retrieved at each FSM. 

finite automaton shown in FIG. 18(6) in the form of a state Accordingly, if there exists an object to search in the 

transition matrix on the above information according to a corresponding text file, each FSM obtains the proper number 

predetermined microprogram. list (t^ t tt t^ t^J. The text proper number list is loaded 

The extended finite automaton that the lists, a list and f in the memory 513 (FIG. 17). The microprocessor MPU 511 

list directly imply has a simple form asshowninFIG. 18(<z). 40 detects a physical address of each text according to a 

The drawing illustrates two transitions predetermined program (FIG. 20) in a microprogram 

memory 512. The text proper number and the physical 

(S/CijSa) address are managed by TEXT-DIR illustrated in FIG. 9, and 

(Sjc Sq)» (12) ^ takk 0211 be read 01,1 of the ffle 451 and ^ s detected, 

45 The microprocessor 511 then reads each text data out of 

in the a list the file 451. A file control division 531 inputs text data 

The microprocessor 511 extends and transforms the (character string) thus read out successively to an FIFO 

extended finite automaton shown in FIG. 18(a) to the one as (first-in-first-out) circuit 532. The microprocessor MPU 511 

shown in FIG. 18(fc). The transformation is determined reads characters one by one out of FIFO 532 and verifies 

identically. A predetermined partial character string can be 50 whether or not a predetermined partial character string is 

searched from the ambiguous text to be retrieved according present according to the finite automaton (FIG. 18(fc)) 

to the transformation. Here, in the drawing, f(Sj) is a failure defined in the memory 513. A string matching result b fist 

function made out of the failure transition list f list, indi- (FIG. 20) is returned to the memory 402 of the upper 

eating a state of the destination of transition when failing in processor. 

matching at the state S^. Then, the state W, corresponds 55 CPU 1 arranges text proper number lists with retrieval 

one-to-one to the state S,, scanning the ambiguous character conditions coincident with each other which are sent back 

string t^en within symbols [to ]). Further, the states Ty^T^ from a plurality of lower FSM's into one according to a 

are states coming out of the state W. correspondingly to a predetermined program and transmits them further to the 

transition from the state S,, indicating that the character memory 102 in the upper control subsystem. A document 

being retrieved (C kl or in the drawing) has been found 60 proper number DOC# with partial character strings matched 

in the ambiguous character string. therefor and a proper number of a document image 1MGID 

Practically, the microprocessor 511 is capable of gener- or a tide TITLE can be identified from the text proper 

ating the state transition table shown in FIG. 19(a) directly number by referring to the main directory 153 (FIG. 5). 

from the two lists, a list and f list A column (vertical) in the The retrieval results are sent back to the terminal 800. 

state transition table indicates a current state, and a row 65 Users are capable of calling the image of a desired document 

(lateral) corresponds to a character (code) inputted under the to a CRT to display thereon while observing the title and 

state. The state to transit next is written in the table. Since others on the CRT. 
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A second illustrative example will be described, next In 
the example, a configuration of the flexible string matching 
circuit 501 only is different FIG. 21 is a configuration 
drawing of the flexible string matching circuit FSM in the 
second example. 

In the drawing, a secondary storage unit (text file) 461 has 
a plurality of heads capable of reading a signal 
simultaneously, and in the example, data can be read out of 
the four heads simultaneously. The data is transmitted to 
four FIFO circuits 551 to 554 each by way of a file control 
unit FCU 541. 

On the other hand, retrieval conditions sent from the 
upper subsystem 400 are interpreted by the microprocessor 
551 and then transmitted to microprocessor units MPl^ 561 
to MPU 4 564 including data memories. 

Text data read out of the text file 461 are read to the 
microprocessor units 561 to 564 each by way of FIFO 
circuits 551 to 554. The microprocessor units search in 
parallel a predetermined partial character string from among 
four character strings (text data) and sends the result back to 
the microprocessor 511 by way of a data bus 521. 

Since the other portions are equal to those of the first 
example, a description will be omitted. 

A third illustrative example will be then taken up for 
description. In the example, the hardware configuration is 
the same as those of the first and second examples, but the 
text searching is different 

In taking up the case where a document to be retrieved is 
narrowed down by means of a keyword or classification 
code according to a hierarchical retrieval method, the docu- 
ment screened in the process is generally unevenly distrib- 
uted to a volume of the text file. 

In the system of the example, a text data is stored 
redundantly in a plurality of text file volumes far multiplic- 
ity. According to a predetermined program, CPU 401 (FIG. 
4) selects a volume to access so as to even the frequency of 
access to a plurality of volumes for the texts stored redun- 
dantly in the volumes. According to the system, all the 
flexible string matching circuits operate efficiently, and a 
high-speed retrieval can be realized as a whole. 

In the above example, a multiplicity of the flexible string 
search circuit is 3 or 4, however, the multiplicity is not 
particularly limited in the system according to the invention. 

Then, the text search is carried out of the whole document 
uniformly and so described hereinabove, however, informa- 
tion on the page boundary will be recorded in the text in a 
special symbol, and a page number successful in string 
matching can also be output as a matching result the system 
of which is also included in the invention. 

Further, the description has been given on an English text, 
however, the system can also be applied likewise to other 
languages. 

Then, the text data is extracted through character recog- 
nition in the above example, however, the mode of a text 
content retrieval is apparently applicable to a text data 
inputted by hand, which is included in the invention. 

Further, a system status has been described as illustrated 
in FIG. 4, however, it remains unchanged substantially in the 
case of miniature system or stand-alone system, which is 
also included in the invention. In particular, it is conceivable 
that a text file and an image file provided in another system 
be loaded to a small scale retrieval station, which is included 
in the invention. 

Still further, it goes without saying that retrieval condi- 
tions can be combined through a logical operator or 
extended so as to retrieve the partial character string satis- 
fying a relative positional relation. Particularly, a combina- 
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tivc high retrieval can be realized at high speed through 
postprocessing by outputting a presence of each of a plu- 
rality of partial character strings. 
As described above, according to the system of the 
5 invention, a desired document can be retrieved at high speed 
by referring to the contents of the document text, and also 
retrieved efficiently from a conception which is not conceiv- 
able at the point in time of having registered the document 
Particularly at the time of registration, there is no necessity 
10 for worrying excessively about what is suitable to put as a 
classification code or key-word. A retrieval precision can be 
enhanced consequently and a noise occurrence can be sup- 
pressed at the same time. 

Further, a text can be retrieved at high speed by juxta- 
15 posing the text search subsystem internally. A high speed 
operation can be attained particularly by adding a string 
matching circuit at every reading heads. 

In the case of a retrieval far a large scale document file, 
the text contents can be retrieved by decreasing documents 
20 to be retrieved according to a keyword and bibliographical 
items, thus realizing an efficient retrieval as a whole. 

Then, for obtaining a text data from document images, a 
document recognition result must have been inspected in 
each occasion to correct errors in the prior art, however, no 
25 attendant is particularly required therefor according to the 
invention. The text content retrieval has not been substan- 
tially realized hitherto for the reason mentioned above, but 
an effective text content retrieval can be secured by the 
invention. 
30 What is claimed is: 

1. A document storage and retrieval system for storing and 
retrieving textual documents, comprising: 

image file means for storing textual documents which are 
digital image data, said textual documents including 
33 bibliographic items providing bibliographic informa- 
tion of said textual documents and body text data 
providing data of text found in bodies of said textual 
documents; 

document recognition means, coupled to said image file 
40 means, for recognizing said textual documents, said 
document recognition means includes: 

(a) means for extracting pattern elements forming char- 
acter patterns from said digital image data, 

(b) a document knowledge file for storing regulations 
45 of a layout of said bibliographic items in said textual 

documents as document knowledge, 

(c) character segmentation means for extracting char- 
acter patterns by analyzing said pattern elements 
with reference to said document knowledge in said 

50 document knowledge file, and 

(d) recognition means for recognizing said extracted 
character patterns, said recognition means outputs a 
recognition result including said bibliographic items 
and said body text data with a layout structure name 

55 corresponding to the recognition result; 

data base file means, coupled to said document recogni- 
tion means, for storing said bibliographic items and 
information as bibliographic information of said out- 

gg putted recognition result with said layout structure 
name; 

text file means, coupled to said document recognition 
means, for storing at least said body text data as 
document contents of recognized textual documents; 
65 input means for inputting a request of a search keyword; 

retrieval means, coupled to said image file means, said 
data base file means, said text file means and said input 
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means* for retrieving digital image data of at least one 
textual document which includes said search keyword 
based on said stored bibh'ographic information and said 
stored body text data; and 
output means, coupled to said retrieval means, for out- 5 
putting said retrieved digital image data of at least one 
textual document 

2. A document storage and retrieval system according to 
claim 1. wherein said bibliographic items each include a 
title, an author's name or classification of a document. 10 

3. A document storage and retrieval method for storing 
and retrieving textual documents, comprising the steps of: 

storing textual documents which are digital image data 
said textural documents including bibliographic items 
providing bibliographic information of said textual 15 
documents and body text data providing data of text 
found in bodies of said textual documents; 

recognizing said textual documents, said recognizing step 
includes the steps of: ^ 

(a) extracting pattern elements forming character pat- 
terns from said digital image data, 

(b) storing structural regulations of a layout of said 
bibliographic items in said textual documents as 
document knowledge, ^ 

(c) extracting character patterns by analyzing said 
pattern elements with reference to said document 
knowledge, and 

(d) recognizing said extracted character patterns, and 
outputting a recognition result including said biblio- 30 
graphic items and said body text data with a layout 
structure name corresponding to the recognition 
result; 

storing said bibliographic items and information as bib- 
liographic information of said outputted recognition 35 
result with said layout structure name; 

storing at least said body text data as document contents 
of recognized textual documents; 

inputting a request of a search keyword; 

retrieving digital image data of at least one textual docu- 40 
ment which includes said search keyword based on said 
stored bibliographic information and said stored body 
text data; and 

outputting said retrieved digital image data of at least one ^ 
document 

4. A document storage and retrieval method according to 
claim 3, wherein said bibliographic items each include a 
title, an author's name or classification of a document 

5. A document storage and retrieval system for storing and 
retrieving textual documents, comprising: 50 

an image file storing textual document image data said 
textural documents including bibliographic items pro- 
viding bibliographic information of said textual docu- 
ment image data and body text data providing data of 55 
text found in bodies of said textual documents image 
data; 

means for extracting pattern elements forming character 
patterns from said textual document image data; 

a document knowledge file storing structural regulations 60 
of a layout of bibliographic items in said textual 
document image data as document knowledge, accord- 
ing to each kind of textual document; 

means for extracting subsets of pattern elements that 
constitute each bibliographic item, from said extracted 65 
pattern elements with reference to said document 
knowledge, and adding a name of a bibliographic item 
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corresponding to said extracted subset of pattern ele- 
ments to said extracted subset of pattern elements; 

means for recognizing character patterns as extracted 
pattern elements and generating a string of character 
codes corresponding to said extracted subset of pattern 
elements that constitutes a bibliographic item; 

a text file storing said string of character codes when said 
string of character codes corresponds to document 
contents; 

a data base file storing said string of character codes when 
said string of character codes corresponds to biblio- 
graphic information; 
means for inputting a request of a search keyword; and 
means for retrieving textual document image data of at 
least one textual document which includes a string of 
character codes corresponding to said search keyword 
based on strings of character codes stored in said text 
file and said data base file. 

6. A document storage and retrieval system according to 
claim 5, wherein said bibliographic items are predetermined 
items of document attributes, including a title, an author's 
name and classification of a document 

7. A document storage and retrieval system according to 
claim 5, wherein said data base file stores strings of char- 
acter codes corresponding to predetermined bibliographic 
items representing bibliographic information; and 

wherein said text file stores strings of character codes 
corresponding to predetermined bibliographic items 
representing document contents. 

8. A document storage and retrieval system according to 
claim 5, further comprising: 

means for outputting a textual document image corre- 
sponding to said retrieved at least one textual document 
from said image file. 

9. A document storage and retrieval system according to 
claim 5, further comprising: 

a scanner reading an image of a textual document opti- 
cally and generating said textual document image data. 

10. In a document storage and retrieval system which 
holds data of textual documents in the form of an image and 
text, and retrieves textual document image data of at least 
one textual document which includes an inputted search 
keyword based on said data of documents in the form of text, 
a document storage method comprising the steps of: 

reading textual document , image data of textual docu- 
ments in the form of an image, said textual document 
data including bibliographic items providing biblio- 
graphic information of said textual document image 
data and body text data providing data of text found in 
bodies of said textual documents image data; 

extracting pattern elements forming character patterns 
from said textual document image data; 

extracting subsets of pattern elements that constitute each 
of a plurality of said bibliographic items, from said 
extracted pattern elements, with reference to structural 
regulations of a layout of said bibliographic items in 
said textual document image data according to each 
kind of textual document; 

adding a name of a bibliographic item corresponding to 
said extracted subset of pattern elements to said 
extracted subset of pattern elements; 

recognizing character patterns as extracted pattern ele- 
ments; 
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generating a string of character codes corresponding to 
said extracted subset of pattern elements that constitute 
a bibliographic item; and 

storing strings of character codes in a text file when said 
string of character codes corresponds to document 
contents and in a data base file when said string of 
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character codes corresponds to bibliographic informa- 
tion. 

11. A document storage method according to claim 10. 
comprising the step of: 
reading an image of a document optically and generating 
said document image data. 
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